Imagine putting considerable resources into creating an AI-powered solution, and witnessing its performance progressively decline over time, without a clue what might happen.
This decline is often caused by data drifts and shifts, phenomena that can harass even the most well-designed machine learning models.
As the environment in which your AI model operates evolves, its capacity to make precise predictions and provide dependable results can falter, presenting a challenge for businesses and technical managers alike.
While focusing on research and early proof-of-concept stages might often seem like a priority, it’s crucial to recognize that data drifts and shifts will likely emerge at some point, impacting your AI application’s performance.
By understanding the different types of drifts and how they relate to your model’s environment, you’ll be better equipped to take action and maintain the efficiency of your AI solution.
In this article, I’ll delve into the world of data drifts and shifts, shedding light on these often-overlooked factors that can significantly affect your AI application’s performance.
Key takeaways:
- Data drifts and shifts can significantly impact AI model performance.
- Different types of drifts and shifts include concept drift, label shift, covariate shift, feature change, and label schema change.
- Effective drift and shift detection is key to maintaining AI solution accuracy.
- Monitoring data distributions and model performance metrics can help detect and address drifts and shifts.
- Periodic retraining using newly gathered, labeled data is an effective way to tackle drifts and shifts.
Different Types of Data Drifts and Shifts: A Comprehensive Breakdown with a Waste Segregation Example
In the world of AI, data drifts and shifts can have significant impacts on the performance of machine learning models.
These phenomena can influence both input variables and model targets, ultimately impacting the accuracy and effectiveness of your AI solution. To better understand these types, let’s explore each one in detail using the waste segregation example.
In this example, we are developing an AI solution for automatic waste segregation that recognizes four types of waste: “mixed”, “glass”, “plastic” and “paper”. The model sorts waste based on photos, deciding which waste bin each item belongs in.
One more thing before we start. While reading this article, you may be curious about the nuances between drift and shift and whether they can be used interchangeably.
Typically, drift denotes gradual changes in data distribution over time, affecting machine learning model performance. Conversely, shift specifically targets changes in input or output data distribution, irrespective of their rate. You might encounter both terms in various resources that describe these phenomena.
Concept Drift
Concept drift happens when, for a given input, the real-world output differs from what the model has been trained to predict. This occurs because the underlying relationship between the input and output has changed, and the machine learning model is unable to adapt to the new reality.
In the waste segregation example, concept drift would occur if, due to a change in regulation, a wet paper should now be placed in the paper bin instead of the mixed one. Although the input (photo of wet paper) remains the same, the output class (paper bin) changes.
Label shift
Label shift, also known as target shift, occurs when the distribution of the target variable (in our case, waste types) changes relative to the training dataset.
Although the model still predicts the same output for a given input, the learned relationship between input and output data becomes disrupted, which can affect the model’s ability to generalize effectively for new data.
In the waste segregation example, a label shift could occur if there were a change in habits that led to an increase in the percentage of glass and paper packaging waste, while the percentage of plastic waste decreased substantially.
As a result, the model would be trained on a different distribution of the target variable, potentially causing it to lose its ability to generalize well and maintain its accuracy in predicting waste types.
Covariate Shift
Data drift refers to changes in the data distribution. A specific type of data drift, called covariate shift, occurs when the relationship between a single input and its output remains constant, but the input data distribution changes.
This can happen if the training data distribution doesn’t accurately represent the input data distribution during the model’s working conditions.
Covariate shifts can be caused by various factors, including changes in trends, regulations, or market conditions.
In the waste segregation example, a covariate shift would occur if the appearance, colors, and visual characteristics of discarded packaging changed over time. The relationship between the input (photo) and the output (waste type) would stay the same, but the input data distribution would be different.
Other Types of Data Drift
There are several other types of data drift that can impact model performance:
Feature Change: Feature change occurs when a new feature is introduced, an old feature is removed, or the value range, classes, or coding of a feature changes.
For instance, in the waste segregation example, a feature change would happen if we connected our model to a camera that only captures images in grayscale, resulting in the loss of color information.
Label Schema Change: Label schema change happens when the possible range of values for the model output changes.
In the waste segregation scenario, this would occur if we added a new type of waste, such as “bio” to the set of recognized waste types.
The Impact of Data Drifts on Model Performance and How to Detect and Address Them
Data drifts and shifts can lead to a decline in model performance, affecting the overall efficiency of your AI solution.
While some changes in data distributions might not always result in performance degradation, it’s important to understand how these drifts can impact your app and the best practices to detect and address them.
While some changes in data distributions might not always result in performance degradation, it’s important to understand how these drifts can impact your app and the best practices to detect and address them.
Impact on Model Performance
The degree to which model performance is impacted depends on the scale and type of data drift.
For example, rapid concept drift or label schema change (e.g., due to legal regulations) might make the model non-functional in solving a given problem.
Typically, these changes are gradual and challenging to detect, which can be particularly tricky, as slow changes might easily be overlooked and lead to a decline in the model’s performance.
Detecting Data Drifts
Detecting drifts can be difficult, and not every drift will necessarily harm model performance. So it’s good to focus on situations where performance metrics decrease or odd anomalies occur.
Covariate Shift and Feature Change Detection (Without Labels)
In cases of changes in input data distribution, you can act without labels for the data collected during the model’s operation (referred to as operational data).
Observing input data distribution statistics from both the training set and the operational data can help, but it might not be enough to detect drifts.
Two-paired statistical tests, such as the Kolmogorov-Smirnov (K-S) test, can be used to compare variable distributions and assess significant differences.
Additionally, you can train a classifier to distinguish between the training set and the operational data — if it performs well, there’s a likelihood of a covariate shift.
Detecting drifts can be difficult, and not every drift will necessarily harm model performance. So it’s good to focus on situations where performance metrics decrease or odd anomalies occur.
Concept Drift and Label Shift Detection (With Labels)
Labeled operational data collected during the model’s operation is crucial for detecting these drifts.
You can check if the model’s performance decreases in production or if there are anomalous conditions.
Statistical tests comparing target distributions from the training set and operational data can also be performed.
To detect concept drift without labels, you can use Statistical Process Control methods like CUSUM, EWMA, or control charts, but their unsupervised nature might not be sufficient in many cases.
Domain Knowledge and Visual Inspection
For label schema change, concept drift, and other methods, up-to-date domain knowledge and observation of trends or changes in the model’s operating environment can be useful in detecting factors that may affect the model’s performance.
Addressing Data Drifts
Creating robust models capable of generalizing may seem like the solution, but it is often infeasible or unprofitable for specific tasks.
The most effective method to tackle drifts is periodic retraining using labeled operational data collected during the model’s operation.
Labeling can be labor-intensive, and deciding when to label and how to sample a collection is a separate issue, but it is an effective method for combating most problems.
Depending on the needs, additional methods like data re-labeling (in the case of label schema change or concept drift) or cost-sensitive learning (for label shift) can be employed.
Maintaining AI Performance in the Face of Drifts and Shifts
In summary, data drifts and shifts are common occurrences that can negatively impact the performance of AI solutions.
By understanding their different types and staying vigilant, you can detect and address these issues more effectively.
Regularly monitoring your models and leveraging domain knowledge, labeled operational data, and appropriate statistical tests are crucial steps in maintaining optimal performance.
Don’t let drifts and shifts degrade your AI solution’s performance; stay proactive and ensure your models remain up-to-date and relevant.
Some sources and more knowledge:
- “Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications” Chip Huyen
- “Machine Learning for Data Streams: with Practical Examples in MOA” Albert Bifet, Ricard Gavaldà, Geoff Holmes, Bernhard Pfahringer
- “Dataset Shift in Machine Learning” Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence