Reducing Risk in Machine Learning Deployments
9 Ways to Reduce Risk in Industrial Machine Learning Deployments
When developing and deploying machine learning solutions for heavy-industry use cases, it is important to recognise the associated costs, risks, and benefits. In this article, we will discuss nine ways to reduce risk in industrial machine learning deployments.
1. Recognise that Machine Learning is Probabilistic, not Deterministic
IT is deterministic, whereas AI is probabilistic.
The risk profiles are vastly different, do not confuse the two. Machine learning model development is experimental by its very nature; "lets see if this data, plus this algorithm, can deliver this result". If you cannot tolerate the uncertainty that this entails, then machine learning is not for you.
The reward which hedges this high uncertainty is the strategic value yielded by the new insights. Since machine learning models are software defined, the cost of change is low and the time to value can be fast. The time to value, and final capabilities of a machine learning solution are challenging to know in advance. However, de-risking can be achieved through delivery in clear increments of value.
Analytical approaches for a particular dataset do not spring from the void, but from a precursor; the data itself. Since machine learning is data-driven, outcomes are oftentimes emergent. The initial concept is not always the best solution, and the best solution is not always self-evident at the outset of a project. The possibilities for improvement are only surfaced through the process of iteration and exposure to the data.
Given the idiosyncratic nature of each dataset, machine learning solution delivery thrives on the process of iteration. Outcomes of machine learning projects are reliant on patterns and relationships embedded within the data. Such patterns and relationships are not always self-evident at the outset of a project. Since outcomes of machine learning projects are emergent, empirical and data-dependant, project stakeholders must recognise that the initial concept is not always the best solution, and thus, the initial conception of a project should not be chiselled in stone.
Bootstrap Your Insights
The availability of insights (derived-data, cross-referenced data) accumulates incrementally as an analytics project progresses. Each subsequent insight can derived and enriched from the prior accumulation of insights. Value delivery accelerates as insights accumulate. This inherent acceleration of value delivery by bootstrapping insights is a key driver of the strategic value of machine learning.
2. Improve Data Quality
Data quality is the primary risk driver in machine learning deployments. “garbage in, garbage out”; if the training data quality is low, your model will perform poorly, irrespective of model sophistication. The focus should be to eliminate quality issues by substituting with high quality data, where possible. If the data is noisy, incomplete, or has erroneous values, performance will suffer.
If substituting is not possible, then the focus should be to reduce the impact of the quality issues. Ensemble learning is a technique that can be used to reduce the impact of noisy data. An ensemble of "weak learners" is a group of models that are individually weak (i.e. high bias), but collectively strong, due to differing strengths and weaknesses. The composition of the individual biases complement each other. Performance of an ensemble of weak learners is stronger than each weak learner individually.
In industrial scenarios, data can be incomplete, inaccurate, or inconsistent due to a lack of standardisation, data governance, or data quality management. At Traversal Labs, we reduce risk in our industrial data analytics process by beginning projects with a data discovery process to determine the quality of the data, then structure projects such that continuation is conditional on the quality of the data. The discovery illuminates the state of the data to help identify potential issues early in the project lifecycle.
3. Scale The Models
A lack of scale can also be a risk factor in machine learning deployments. If there is inadequate data, the model may not be able to learn effectively. It is reasonable to expect that the performance improves as the amount of data increases (assuming the data is representative of the process being modelled).
Scale your Data Collection Process
Depending on the capture methodology, higher data volumes are typically more representative of the process being modelled, this higher data volume helps to reduce the impact of noise in the data. Increasing data points makes the model more robust, and reduces the risk of overfitting.
Reduce the Burden of Human-annotated Labelling
In the case of supervised learning, labelling can become unscalable. Human-annotated labels are expensive and time consuming. The cost of collecting human-annotated labels can introduce notable financial risk to machine learning projects. At Traversal Labs, we have developed a selection of techniques to reduce the burden of collecting human-annotated labels by upwards of 95%.
4. Maintain Oversight Over the Broader Data Pipeline
Having end-to-end oversight of the data from the moment of capture (or synthesis) to the moment of analysis enables the data to be managed and controlled throughout the data pipeline.
Having control over the data generation process enables the data to be managed and controlled from the moment of capture. In an industrial scenario, this can include practical considerations such as sensor selection, sensor placement, and sensor calibration. Enhanced data oversight gives greater control over the integrity and quality of the data. Thus, reducing the risk of poor model performance due to data quality issues.
5. Eliminate Data Leakage
Data leakage is can lead to inflated expectations of model performance, and poor generalisation to new data. Data leakage between the training and test sets can subvert the performance metrics, because evaluation is performed on "leaked" data which has already been seen during training. Data leakage causes results which often appear "too good to be true". A model trained on leaked data will perform deceptively well on the test set, but will not be able to generalise to new data and will perform poorly in production.
It is important to consider cues which may violate the independence of the training and test sets. This can include confounding variables, measurement biases, or data points which are not ID'd properly.
Location and time are common confounding variables. For example, if a model is trained on data from a single location, it may not generalise well to other locations due to an unaccounted confounding variable. To alleviate the risk of data leakage, care must be taken to ensure the independence of training and test sets.
6. Establish Strong Channels of Communication Between Business and Data Stakeholders
Clear channels of communication must be opened between technical and business stakeholders. Such communication is paramount in ensuring that the business requirements are translated into technical outcomes, and that the technical outcomes are translated into business value.
Define Your Objectives Clearly
Clear objectives, and complementarity between the business and data stakeholders are critical to the success of the project. Otherwise, project risk is substantial. This is a two-way street, business stakeholders must recognise the technical constraints and limitations of the data and the project, and data stakeholders must recognise the business context and constraints. This mutual understanding is critical to the success of the project, and the effective translation of technical outcomes into business value.
Engage In Regular Feedback
For machine learning, a regular cadence of feedback is important to ensure alignment. Check-ins between business and data stakeholders should take place fortnightly or weekly typically, daily in rapid prototyping sprints. Between each weekly check-in, there should be an incremental delivery of project value. A balanced complementarity of the business and data stakeholder inputs is critical to project success.
Be Concious of Costs
The costs of machine learning projects vary depending on the project scope, the data, and the model. The cost of data acquisition hardware (per sensor), and the cost of data labelling (if required), and the cost of model design can be significant. However, in heavy-industry use cases, the cost of machine learning deployments is often relatively low compared to the capital requirements of property, plant, and equipment. It is important to balance the cost of the project against the expected return on investment. This varies depending on the use case, and the business context.
The project cost structure is typically separable into the following categories:
Cost | Description | Type | Project Stage |
Data Acquisition | The cost of data capture hardware | Hardware | Pre-Analytics |
Data Labelling | The cost of labelling data | Labour | Pre-Analytics |
Data Discovery | The cost of the data discovery process | Analytics | IDA |
Data Preprocessing | The cost of preprocessing the data (cleaning, augmentation) | Analytics | IDA |
Data Analysis | The cost of the data analysis process | Analytics | EDA |
Data Visualisation | The cost of data visualisation | Analytics | EDA |
Data Modelling | The cost of the data modelling process | Analytics | Modelling |
Data Validation | The cost of validating the model | Analytics | Modelling |
Data Deployment | The cost of deploying the model | Engineering | Deployment |
Data Maintenance | The cost of maintaining the model | Engineering | Deployment |
Data Storage | The cost of data storage | Hardware | All Stages |
Data Processing | The cost of data processing (CPU & GPU) | Hardware | All Stages |
Data acquisition costs may be optional, depending on the availability of data at project outset. The cost of labelling data is typically expensive, and is often most difficult to estimate. We help mitigate this with our automatic labelling techniques.
Project risk can be reduced substantially through a data discovery process, which helps to identify the most promising data sources. Deployment risk can be reduced by ensuring that the project is containerised for accessible deployment to client systems.
7. Consider The Security & Regulatory Implications
Security Risks
With new technology comes new risks. Machine learning models are vulnerable to a range of security risks, including:
Adversarial Attacks
Adversarial attacks are a class of attacks which involve the injection of malicious data into machine learning models. This injected data can be used to subvert the models into making incorrect predictions.
Confidential Data Leakage
Confidential data leakage attacks are a class of attacks which involve the ingestion of confidential data into machine learning models. This can occur during model training, or during model inference. Predictions that the model makes may reveal this confidential information.
Regulatory Compliance
Machine learning models must comply with relevant regulations and laws. This varies by jurisdiction, and by industry.
8. Consider The Model Lifecycle
Deployment & Compatibility With Client Systems
To ensure that the project is compatible with client systems, the project should be containerised. Containerisation enables a project to project to be deployed onto client systems with minimal effort.
Containerisation reduces the risk of compatibility issues between the client's IT systems and the project. We at Traversal Labs uses Docker to containerise our projects.
Continual Improvement over Time
Machine learning models can, and often should, be configured improve over time, either in a continuous manner, or at planned intervals. This ensures that the model representation remains consistent with changing the state of the operating environment. The incremental learning regime can help to reduce the risk of model obsolescence.
With such an approach, maintaining traceability of data and model performance is critical. Model performance should be tracked and monitored over time, alerts and performance control mechanisms should be in place to mitigate for performance degradation. Should performance degradation be (automatically) detected, the incremental learning system should revert to a validated previous version of the model.
9. Combine Complementary Modalities
Stacking insights from disparate sources of data is a way of reducing risk. Data spanning multiple modalities can improve predictive outcomes through multisensory consensus. Prediction from disparate sources of data inputs can be combined to improve the predictive outcomes, yielding a more robust model by considering disparate sources of evidence.
Conclusion
We listed nine ways to reduce risk in machine learning deployments. Ranging from data quality, to project delivery, and from data complementarity to continuous learning. There are many important factors to consider when deploying machine learning in your organisation. By understanding and managing the potential risks associated with machine learning deployments, industrial organisations can realise the data-driven benefits of machine learning.
At Traversal Labs, we understand the importance of reducing risk in machine learning deployments. We have designed our Data Discovery Process to mitigate the above risks associated with machine learning deployments. Our process is targeted towards heavy-industry, and is designed to ensure that the data is of high quality, sufficient volume.
Topics: