In an earlier post, I talked about predictive model deployment with Spark. That post mainly addresses the issues around how to persist a trained model and serve it online in real-time to users. However, there are many other concerns when one set out to build a predictive model system, especially around the robustness and reliability of the system, since predictive models can go terribly wrong without a single error message in the log.
In this post, I am going to discuss, what I’d like to call, the “continuous deployment” aspects of the predictive model system. Specifically, the “test suites”, i.e., the offline evaluation and A/B testing components around model performance, and monitoring/alerting components around feature generation and model validity.
The concepts of offline evaluation and A/B testing shouldn’t be new to most people with some modeling background, since they both are widely adopted techniques. Here I am going to focus on how/why they are incoporated into a predictive model system.
I would compare the offline evaluation and A/B testing components in a predictive model system with the tests (e.g., unit test, integration test, or whatever test) in a general software system, simply because they both need to be run/pass before a model can be actually deployed to production.
A simplified model deployment checklist is:
- model development
- offline evaluation w.r.t. model performance
- A/B testing w.r.t. business metrics
- pick the best model
- model deployment
The offline evaluation is consisted of some performance scores, e.g., AUC and/or F1 score, that needs to pass certain threshold. These scores are calculated off the training/test dataset and serve as an approximate evaluation of the true model performance. It is not as accurate as A/B testing, but it speeds up development iterations significantly, since an A/B test usually takes days to see a significant result if any. I would recommend set a relatively loose threshold for offline evaluation to filter out the obvious losers and pass a set of “good” models to A/B testing.
The A/B testing serves as the gate keeper by running two or more models simultaneously. It evaluates a model performance with true users based on business metrics. In most cases, one would expect the A/B testing results to be consistent with the offline evaluation scores. That is, a better offline evaluated model would win in A/B testing. However, there are times the results are different. In those cases, we should, of course, trust the A/B testing result, but one should also take a step back to think about the reasons behind this discrepancy. My experience tells me that, in those cases, there likely to be something pretty informative to improve the next verison of the model.
These two “tests” help to foster a relative fast model iteration speed and ensure a true model improvement. They are both important steps to achieve an automated deployment with confidence.
In general, good monitoring/alerting setups help to identify possible defects in a system and analyze root cause when things go wrong. For predictive model system, they are arguably more critical and need to be set up with more components than the typical ones for websites. That is, besides the monitoring/alerting on processing time, responsiveness and uptime, we need tracking systems in place for feature generation and model performance.
In a predictive model system, features are usually either generated by periodic offline jobs or read directly from production database (see this post). All deployed models are trained based on training data at the time of model fitting. Therefore, for models to perform properly, the distribution of feature values should be consistent between training data and production data. This seems to be a safe assumption, since, in theory, training data is just a sample of today’s production data. However, in the real-world, there are at least two ways things could go wrong under your nose.
For one possibility, the feature generation jobs may produce erroneous results. It may be due to bugs in the job itself or from upstream data sources. This type of error is close to errors in any software system and we could use similar approaches to set up monitoring/alerting. In addition, it is also helpful to track basic statistics of feature values such as mean, median and variance in case any less obvious things are off (I ran into an instance where all features are generated as 0’s without triggering the failed job alert since the job ran “successfully” with erroneous numerical calculations).
Another possibility is the feature distribution itself may change significantly over time. This could be caused by user behavioral changes led by product changes or other macro changes not directly related to the model itself. Because there is no error per se in such cases, this type of distributional changes is much harder to detect, let alone fix, but it will inevitably compromise the model performance. To catch these unexpected changes, we could try using basic statistics tacking for feature values and/or monitor the model performance online to make sure it is reasonably. In order to evaluate online model performance, a separate offline job is likely needed to gather predicted values versus true outcomes, and compute model performance using the same offline evaluation metrics. The computed metrics is then persisted somewhere in a time-series fashion and used for monitoring/alerting.
With these addtional monitoring/alerting components, one would be more confident in an automated predictive model system and sleep better at night, if not paged by the alerts ;-).
Automation comes with the risk of breaking things.
In this post, I attempted to cover a few components for building reliable predictive model systems and I hope you would find it helpful. Since I just started out to build this type of systems, I am sure there are many pieces that are still missing.
I would really appreciate your thoughts/comments here. Feel free to leave them following this post or tweet me @_LeiG.