Today, more and more products and engineering teams rely on machine learning (referred to as ML through out this blog post). The abundance of open source tools and libraries also makes it much easier to learn, develop, and build ML models even for people with little prior knowledge or experience. ML is a powerful tool for many problems, but it comes with costs — it can introduce complexity to systems which builds up over time and evolves into large technical debt. A recent publication by Google argues that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying ML (see Reference 1). At Quora, we've been using ML to tackle many interesting problems such as ranking, search, recommendation, and spam detection (see Reference 2, 3, and 4). We are constantly evaluating new approaches and building new product features with ML. At the same time, we also strive to be careful about the complexity that these models introduce and have developed principles and best practices to avoid or reduce such complexity. In this blog post, we will share our thinking about complexity in ML systems and describe some of our approaches to mitigate them. Note that most of the problems and solutions in this post can also be applied to general software systems, and vice versa. However, we choose to focus on those that are especially important for ML. Before even thinking about complexity in your ML system, ask yourself if your product feature actually needs an ML solution. Sometimes, ML adds complexity to your system when you could just use a simpler heuristic algorithm that does not require feature engineering, model tuning, continuous training, or model deployment. However, when there are already ML models built for other purposes which you can reuse, going with a heuristic adds complexity. A quick and dirty heuristic might seem like a short-term gain, but is really a long-term pain. Over time it becomes increasingly difficult to understand, depend on, and maintain all the ad-hoc heuristics. The product can also suffer when there are too many different ways to do similar things, resulting in inconsistent user-facing behavior. Therefore, it’s important to be aware of this tradeoff and consult with your team or ML specialists in your organization before investing heavily in any approach. To evaluate whether an ML solution is appropriate for your problem, it is critical that good documentation is kept and shared within the organization. This way, it is possible to understand if there are product features or problems similar to yours that are already tackled using ML. There are also many resources on Quora and online about typical problems that can be solved with ML. Let's take a look at a few examples at Quora. We have developed and productionized ML models for a number of ranking problems such as search result ranking, answer ranking, feed ranking, and digest ranking. In the ranking algorithm, an ML model produces a score that predicts if a user will “engage” with the ranked result. Although not a typical ranking problem, the digest email scheduler can build on a similar ML model to predict the likelihood of user opening the digest email. On the contrary, detecting trending topics or events is often solved using heuristic algorithms that leverage time series analysis. Nobody considers complexity as a positive feature. However, not everyone agrees on qualifying a system as complex or making a given tradeoff for simplicity. It is important to understand the different symptoms of complexity before we agree on how to treat them. So, what do we mean when we look at an ML system and say it is too complex? Below are a list of possible answers. 1. Too many different ways to do similar things An ML system is too complex when there are too many different ways to do similar things. This creates complexity in at least two ways. First, engineers lose time trying to figure out the correct way to do what they need to do. Second, because things are implemented in different ways, maintenance overhead is added. 2. Not providing enough explanation or insight If a system is hard to interpret from the outside and can only be understood as a “black box”, it is generally considered complex. 3. Undocumented functionality Hard-to-understand, undocumented functionality also creates complexity in a system. The actual implementation might not be that complicated, but the fact that it is hard to understand without digging into the details adds complexity. 4. Non-reusable functionality Functionality that cannot be reused in different contexts leads to different ways of doing similar things, and therefore adds complexity. 5. Require many steps to do a “simple” thing Sometimes engineers may feel that an existing system or tool requires too many steps or is too complicated for their use cases. In this scenario, they are likely to come up with a brand new system or tool that is optimized for their specific use cases. While this might make the current implementation simpler, by adding a different system or tool, the overall complexity is increased. 6. Require understanding of many tools Similar to the previous scenario, complexity may arise from imposing the need to understand many or complex tools. For example, if an engineer working on search result ranking needs to understand Python, C++, Gradient Boosted Decision Trees, and Matrix Factorization, and there is no easy way to abstract them from understanding all, the system is considered complex. 7. Unnecessary maintenance overhead A system is qualified as complex if it adds unnecessary maintenance overhead. For example, it might generate pager duty burden or add monitoring and retraining costs. Engineers do not build complex solutions just for fun, but projects have constraints that might push them to build something unnecessarily complex. 1. Scrappiness Intuitively, it appears as if scrappiness should lead to a simple solution since the goal is to get to it as soon as possible. However, that is rarely the case. As explained earlier, the fastest solution often leads to a local optimum but does not reuse anything existing nor can be reused in the future. We think that scrappiness is generally good for development velocity, but it is also important to acknowledge its side-effects and correct for them. 2. Lack of long-term vision Engineers might be too focused on developing something for a specific problem, without paying much attention to whether the system is easy to maintain in the future or can support future use cases. 3. Lack of understanding Not understanding what the current system does may lead to complexity. There might be a way to implement a new use case easily, but a lack of understanding makes it seem complicated or leads to solutions that are more complex than necessary. 4. Lack of flexibility in architecture When an existing architecture is not flexible enough to adapt to a new use case, engineers need to decide between changing the existing architecture or doing a “one-off”. More often than not, “one-offs” are preferred because they are easier and quicker to implement. 5. Lack of feature selection Engineers tend to be more excited about adding new features to the ML model, but care less about removing old features. Old features may no longer be useful after a certain number of iterations, and they make the model harder to understand and more complex. 6. Optimize for accuracy To optimize for accuracy, engineers often use approaches like ensemble and combine results from multiple ML models in the system. While this is usually a good way to improve model quality, “overdoing” it may significantly increase complexity that can't be justified by small metric wins. 7. Optimize for performance Sometimes optimizing for performance can also lead to overly complicated or obscure system implementations. For example, for performance reasons, engineers working on search may decide to implement the ranking infrastructure in C++ , whereas the rest of the stack is written in Python, which makes the entire system more complex. 8. Dependencies Building ML systems is hard because there are very few well-known design patterns. In addition, it is very common to have chains of dependencies between data sources and subsystems. It takes an experienced ML engineer to build an efficient yet simple system.本帖隐藏的内容
Do you really need machine learning?
What is complexity?
What pushes engineers to complex solutions?