Three More Reasons to Embrace Automated Machine Learning
In my IIA post in May I wrote about the fact that automated machine learning (AML or, in deference to anti-money laundering, AutoML) has the potential to reduce demand for data scientists, and to make them less sexy in the job market. Before shutting up completely on that topic, however, I thought I’d make another, perhaps more important point about AutoML: it’s good for your company’s analytics function.
AutoML has the potential to transform not only machine learning, but the practice of analytics in general. The technology typically performs some combination of the following functions:
Automated data preparation;
Automated feature engineering;
Automated competitions among different algorithm types;
Generation of explanations for why certain variables or features are most influential in models;
Creation of program code or APIs for model deployment.
Just about every analytics organization does all or most of these things, and if you’re not doing some, you probably should be. I’ll discuss the benefits of AutoML—beyond reducing the need for highly-trained data scientists—in three different categories. I discovered these in a research project on DataRobot customers, so I can only state for certain that they apply with that vendor. However, some of them are generalizable to other providers (SAS, H2O, Google Cloud) as well.
The greatest productivity advances from AutoML probably come in model development. Some organizations use it only for that purpose, although I would argue that is shortchanging the technology. The core activity that AutoML performs is to consider a large variety of possible algorithms and transformations, and determine which model is the most effective in predicting or explaining variations in your dependent variable. It can also help with feature (variable) selection and engineering, data preparation, and evaluation and comparison of results.
I interviewed the head of data science support at a large U.S. property and casualty (P&C) insurance company about this. At this firm, modeling productivity was the primary objective in adopting AutoML. Thus far, he noted that “it has been a very helpful throughput tool.” The insurance giant uses AutoML to get a quick reading on the ROI of alternative machine learning projects. “We get some data, turn DataRobot loose on it, and see what the prediction accuracy is for the model. It’s so quick that we can figure out the value of an analysis without taking a lot of time to assess it,” noted the manager. The company can learn what the key parameters of the model are, what algorithm is best suited to the problem, and what the likely ceiling is on model accuracy. If it seems to be a promising analysis, the company will take it further and perhaps put it into production.
At least one user of AutoML has measured the improvement in modeling productivity. In a pilot using DataRobot, the Data Science function at Sompo Holdings, a Japan-based global insurance company, found that the time required to develop models in a particular underwriting domain was reduced by 73% (from 13 person days to 3.5). The predictive ability of the models was also increased by about 5% on average. Could your organization benefit from that kind of productivity boost?
Deployment (sometimes called “productionizing,” though it’s not an elegant term) of machine learning models is an important component of effective machine learning. It is the process by which analytical models created in the machine learning process are embedded within other systems and processes for purposes of “scoring” cases where the outcome variables are not known. Production systems with machine learning models have to be available anytime, have low latency, and high throughput.
Unfortunately, many analytics and machine learning models—even perhaps a majority of them—are never deployed because the requirements are so difficult, and the part of the organization that does deployment is different from the model development group. Fortunately, there are deployment capabilities within DataRobot, and several other vendors offer versions of them as well. While they may be used less frequently than the model development capabilities, some firms have found considerable value from them. 84.51°, a wholly owned analytics subsidiary of Kroger, is perhaps the organization that has made most use of the deployment capabilities within AutoML that I have come across. Scott Crawford, who heads the Embed Machine Learning initiative at the company, pointed out that issues around deployment (or “productionalization”—another version of that term) are often underestimated:
Prior to my current role facilitating the use of machine learning at 84.51°, my work experiences included building and deploying models at one of the nation’s largest insurance companies and one of the world’s largest banks. One commonality across all my experiences is that productionalization is often the most challenging phase of machine learning projects. The requirement of a production deployment often severely constrains the viable solutions. For example, productionalization might require code to be delivered in a specific language (e.g., C++, SQL, Java, etc.) and/or to meet strict latency thresholds.
Automated machine learning tools can help with the deployment process by generating code or APIs that embed the model. 84.51° often outputs Java code for data preprocessing and model scoring. It lifts code out of its AutoML system, deploys it into a new system, and then the production system is freestanding and can compute analytically-derived outcomes in a fast and ready fashion.
A European insurance firm’s data science leader agreed that outputting Java code was a very useful feature in deployment:
In several client-facing projects we have used the deployment capabilities of DataRobot. You can deploy as an API or export code in Java. We like the Java option better because we are a Java shop and if we have a piece of code we can post somewhere it makes integration with our legacy systems much easier, and there are fewer security challenges compared to external APIs.
MODEL AND FEATURE EXPLANATION
A final benefit for many organizations using AutoML is model and feature explanation. This feature (originally called “reason codes” in DataRobot, and now called “prediction explanations;” an open source program to provide explanations is LIME, or “Local Interpretable Model-Agnostic Explanations”) is important because complex analytical or machine learning models can be difficult to interpret. They may involve many different features or variables, and their relative importance in prediction or classification may be difficult to interpret. In some industries such as financial services, model transparency and explainability (for example, for why a customer is extended or denied credit) are required by regulators. In the European Union, under the General Data Protection Regulations (GDPR,) any citizen affected by an analytical model is guaranteed the “right to an explanation.”
The large U.S. P&C insurance firm’s machine learning group I mentioned above uses this capability extensively. The head of data science support noted, “Reason codes are my favorite feature. To see which features are contributing to the model at what level is extremely valuable. It’s very helpful to explain why a particular customer, for example, is popping up as a likely sale for commercial insurance.”
At a large Canadian bank, model explanations are very helpful to both internal audiences and regulators. The bank’s leader of credit told me that that the capability to explain models is “incredible for so many reasons. We sometimes have massive gradient boosted tree models. Variables can be very predictive for small groups. Reason codes let us understand why we make credit decisions, which helps with our regulators. It also helps to dictate actions from some of our marketing models. If we find out that customers who respond to our offers shop at another retailer, we might influence the offer by including a gift card from that retailer. Models can be impenetrable, but the ability to explain them helps a lot.”
These three steps are critical to the success of every organization undertaking analytical or machine learning activities. To be able to do them more productively and effectively is a major benefit to the field. Eventually I suspect that virtually every company will be using these tools. Why not start now and get a competitive advantage in your use of data and analytics?