September 2, 2021 Detect and solve quality issues in AI systems

Karel Vanhoorebeeck helps teams detect and solve quality issues in AI systems. As such, it is a monitoring and observability platform for AI-based systems. Monitoring tools help engineers find out when something is wrong, observability tools help them find out why. Basically, observability platforms offer engineers more insight in their systems, which is crucial for being able to maintain them in a cost-effective manner. In this blog post we dig a bit deeper into why exactly observability tooling is crucial for successful adoption of AI-based systems.

AI vs traditional software

Over the last decade, AI has proven to be capable of impressive feats. That is why countless businesses have already experimented with AI to assess whether it can optimize their business processes or improve their products. This time of experimentation is still in full swing, but more and more businesses have started to deploy AI-based systems and have started to depend on AI for their operations. Just like software has become ubiquitous over the last decades, from giant projects to embedded devices, AI is becoming just as ubiquitous and takes all kinds of forms and shapes from huge complex AI systems, to small embedded smart devices.

However, just as the move to software came with its challenges, so too does the move toward AI come with difficulties that need to be overcome. We will highlight 3 challenges specific to AI systems that are of particular importance for teams building and maintaining such systems.

Challenge 1: Assessing quality is hard

In traditional software, a system is generally monitored by tracking things like request latency, CPU usage and uptime. For AI systems, one also needs to monitor the correctness of the system. Models make predictions, we need to verify that these predictions are (and remain) of good enough quality. This is often far from straightforward.

In the simplest case, correctness can be measured directly using delayed incoming actuals. In such cases, one can invest some effort to set up a system that combines model predictions and actuals into performance metrics that can be tracked. Strangely enough, no out of the box tooling seems to be available for this, so teams need to build this themselves.

In most cases however, there are no delayed actuals coming in. In those cases, one can try to label a fraction of production data periodically to assess performance, which usually involves a lot of manual work.

Alternatively, although not ideal, engineers can extract data & prediction quality metrics (or heuristics) to try and assess their system's data health and model performance. Which metrics this should be exactly is not always straightforward, especially since computing them should be cheap if they are going to be computed for every request. Moreover, measuring data quality often involves many different metrics (a few metrics per input feature for example), leading to an explosion in the amount of metrics to be tracked and amount of alerts generated.

Even when we have metrics or heuristics measuring data health or model performance, metrics only convey limited information. Taking into account the dynamic nature of AI systems (see below), it is of paramount importance that AI engineers can easily debug predictions made by their systems to increase their understanding of the systems behavior. This is especially the case in more complex processing pipelines involving multiple models or extensive pre and post processing. This is where debuggability (or accountability) comes in. In AI, this involves easy access to input and output data, as well as easy access to partial results. Inspecting data can involve all kinds of visualizations: simply plotting an input image, plotting an output segmentation map, explainability techniques and so on. These visualizations may happen in Jupyter Notebooks, but being able to query relevant data easily is still paramount.

Figure 1: A simple visualization can be much more informative than a metric to debug issues. Here, clearly something is wrong with the camera taking the picture since there is a lot of blur.

In short, when compared to traditional software, AI systems require a much bigger investment in monitoring and troubleshooting infrastructure in order to be able to guarantee their quality.

Challenge 2: AI systems are heterogeneous, not homogeneous, and with a long tail

Traditional software is hand written and solves one specific task, like sending an email when clicking a button or storing a file in a database when it is uploaded. For every task that needs to be tackled, explicit code has to be written.

The power of AI is that it can be trained, not explicitly programmed. The advantage of this is that one can create programs that solve multiple similar tasks instead of just one. In AI systems, the input data determines what exactly the task is that needs to be solved and, given a dataset, one can teach a computer to deal with multiple different, yet related, versions of the same task. These different versions will group together in different "scenarios". The dataset used for training is then nothing more than a representation of the scenarios you want your solution to cover, and your trained model is the program that solves the task for those scenarios. These scenarios are often called subsets, segments or slices.

Figure 2: a dataset is a collection of scenarios or slices. A model will perform differently on different slices. Having insight into which slices underperform is crucial for focussed model improvement.

A lot of AI use cases are easy to solve for some scenarios, but suffer from a very long tail of alternative scenarios that are hard to get right. Self driving cars are probably the best example of an application with a very long tail of possible scenarios the model needs to be able to deal with. Just try to imagine how many different situations can occur when driving a car all over the world.

Luckily, most applications are simpler than self-driving cars, but apart from very simple tasks, AI teams need fine grained insight into how well they perform on different subsets of data. For example: models may perform very well on average, but if it serves different user demographics, teams need to be sure they serve every demographic with more or less equal quality.

To make sure AI systems deliver adequate performance to their users, having slice-centric insights of how well a model performs on every slice of data is essential. However, this is a new concept that is lacking in current tooling. 

 Challenge 3: Systems are impacted by outsiders, always changing 

Traditional software runs in a more or less controlled environment where the interactions with the outside world are explicitly defined. Interaction with software typically involves some kind of button to push, which triggers an action that triggers other actions and so on. In short: software processes actions. AI systems on the other hand do not process actions, but process data. This means that their output is determined by the data they receive. Data is both the reason for existence and the weakness of AI systems. When something is wrong with the data, they will not  perform well. Garbage in, garbage out.

The data an AI system processes may originate outside your controlled environment and is a representation of the state of the context it is generated from. For example, your input data may be security camera images which capture the state of the area it monitors. Moving the camera, replacing it or altering the supervised area in some way may impact your data quality and AI performance. Alternatively, your input data may be user generated text input, which might represent recent events, or currently popular words in today's society. If anything changes in those, your model's performance may be impacted. Another example is when your input data is queried from some API or database that is maintained by other teams. This data will then represent the state of those other teams' systems. When those teams update their systems, data distributions or ranges may change unexpectedly, and this may cause your input data to change.

Figure 3: AI systems may serve multiple clients of segments. One client updating their system may impact data quality without anyone noticing. AI teams need monitoring to get alerted of this!

All of this means that as your AI's context changes, this will impact your system. Changes may reduce data integrity, might cause feature distributions to change (data drift) and might even cause your desired output to change (concept drift). These changes may be gradual or it may be slow,  but data is constantly evolving, and so should AI systems. In traditional software, the maintainers determine when to roll out updates. In AI systems, they are forced to do so when the context their AI operates in changes.

AI teams should be on top of this. Their job is to provide a service of constant quality while depending on other people that might have no idea of how this service works internally, or how their daily actions might influence the service. Continuous human oversight, easy insight in your production systems and frequent model iteration are crucial for this.

AI is not “train once, deploy forever”.

Since AI systems only deliver value post-deployment, maintaining performance of AI models is crucial. However, in many companies small teams are responsible for maintaining multiple models, each serving multiple segments of data, and each solving tasks that may constantly evolve, but lack good supportive tooling to do so. As a result, teams struggle to maintain and improve the quality of AI systems. According to Dimensional Research, 70% of companies need to keep allocating more than 50% of the initial resources to maintain an AI/ML project. 

This challenge is also what Cristophe Ré et al. refer to when they state "maintaining and improving deployed AI systems is the dominant factor in their total cost and value".

maintaining and improving deployed models is the dominant factor in their total cost and effectiveness–much greater than the cost of de novo model construction

Limitations of current tooling

Based on our experience and market research, we observed that teams generally take one of 2 options to deal with the problems above. Most frequently, teams stitch together tooling, either originating from the (open source / open core) DevOps sphere or in-house developed. Frequent examples are the Elastic stack to store (structured) text logs and set up some dashboards, storing incoming data on s3 and storing output metrics in databases. Teams then need to interact with different tools to perform data inspection, monitoring and debugging.

We believe that this setup is suboptimal to say the least. First of all, a lot of effort is required to develop, set up and maintain those tools, meaning they are actually much more expensive to use than they may appear to be at first sight. Moreover, these tools are scattered and cumbersome to use, leading to a bad user experience, which in turn leads to poor insights and suboptimal service quality offered by those teams. Also, there is no alerting tuned for AI systems that can deal with the huge amount of metrics and there is generally no support for slice-based metric tracking.

Another option is to use the platforms like AWS Sagemaker,  Azure ML studio, Google AI platform or Databricks. All these players are investing in tools to monitor your AI systems. However, these platforms are complex, require a steep learning curve, require a complete buy-in, are expensive to use and are pretty rigid and focussed specific corporate use cases. For many teams these platforms are not a good choice since they will lead to more frustration than benefit. platform

We at Raymon believe that the challenges mentioned above merit specific tooling to deal with them. Raymon offers a platform that helps teams set up a fully featured monitoring and observability platform, specifically tailored for AI systems. The platform is very extensible, requires low configuration and can be easily integrated.

Model performance monitoring and accountability: is a model evaluation hub. Users can use Raymon to log all relevant data, metadata and metrics related to a model prediction request to our backend. Raymon groups all of the data belonging to a certain request together and helps them fetch any data logged to our system easily. Using so-called actions, any type of visualization of that data can be made that helps teams understand the internals of their system.
Additionally, teams can track model performance with Raymon. They can do so by logging a metric that measures performance to our system, or by using actions. Actions can combine multiple data artefacts into model performance metrics. This can save teams significant development work.

Slice based insights: Raymon allows teams to monitor multiple slices of data, benchmark them against each other using any metric and alerts them if slices have reduced performance. 

Data monitoring: Raymon helps teams extract valuable metrics from their data and models, and helps them set up monitoring for those metrics. Using those metrics, Raymon helps teams monitor performance, data drift, data quality, data novelty and more. All of this can be achieved with very little manual configuration.

In a next post we'll describe the Raymon platform in more detail. In the meantime, please feel free to check out our open source data validation library and send us any questions you might have!